Skip to content

feat(arrow): expose Arrow IPC reader via registerArrow and readArrow#52

Open
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:feat/arrow-source
Open

feat(arrow): expose Arrow IPC reader via registerArrow and readArrow#52
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:feat/arrow-source

Conversation

@LantaoJin
Copy link
Copy Markdown
Contributor

@LantaoJin LantaoJin commented May 15, 2026

Which issue does this PR close?

Rationale for this change

DataFusion 53.x supports Arrow IPC files via SessionContext::register_arrow / read_arrow, but the Java bindings only expose Parquet, CSV, and (in #47) NDJSON. Since JVM results already come back as Arrow batches via the C Data Interface, an Arrow IPC reader on the Java side closes the natural round-trip: Java callers can write Arrow IPC to disk with arrow-vector's ArrowFileWriter, then read it back through DataFusion without going through Parquet or any other intermediate format. Today they have to fall back to CREATE EXTERNAL TABLE … STORED AS ARROW via SQL, which works but bypasses the typed builder pattern.

This PR is the Java surface for the existing upstream functionality. Issue #37 tracks it; the implementation follows the same proto-over-JNI pattern as #47 (NDJSON), #29 (the CSV/Parquet refactor), and the merged CSV/Parquet readers.

What changes are included in this PR?

  • proto/arrow_read_options.proto — new ArrowReadOptionsProto message. Single field: file_extension (default .arrow). Explicit Arrow schema rides on the existing IPC byte channel through the JNI layer, mirroring the parquet/csv/json paths, and is therefore not encoded in this message. No FileCompressionType field — Arrow IPC files carry body compression (LZ4_FRAME / ZSTD per-buffer) inside the file format itself.
  • ArrowReadOptions Java builder with fileExtension(String) and schema(Schema) setters.
  • SessionContext.registerArrow(name, path[, options]) and readArrow(path[, options]) overloads, structurally identical to the parquet/csv/json entry points.
  • native/src/arrow.rs — JNI module that decodes ArrowReadOptionsProto, constructs the upstream ArrowReadOptions, and forwards to register_arrow / read_arrow. Imports ArrowReadOptions from datafusion::execution::options rather than prelude (it's not re-exported there, same situation as JsonReadOptions).

Out of scope (for follow-ups):

  • tablePartitionCols: neither parquet, csv, nor ndjson exposes Hive-style partitioning on the Java side yet. Adding it for Arrow only would diverge.

Are these changes tested?

Yes, 9 new tests across ArrowReadOptionsTest and SessionContextArrowTest.

Are there any user-facing changes?

Yes, purely additive. New public API:

  • org.apache.datafusion.ArrowReadOptions
  • SessionContext.registerArrow(String, String)
  • SessionContext.registerArrow(String, String, ArrowReadOptions)
  • SessionContext.readArrow(String) → DataFrame
  • SessionContext.readArrow(String, ArrowReadOptions) → DataFrame

The new org.apache.datafusion.protobuf.ArrowReadOptionsProto generated class is also exposed via the protobuf-Java output, consistent with how CsvReadOptionsProto, NdJsonReadOptionsProto, and ParquetReadOptionsProto are exposed. No API removals, no deprecations, no behavior change for existing callers.

Closes apache#37

Mirrors the existing parquet/csv/json reader pattern for Arrow IPC files. Adds:

- proto/arrow_read_options.proto with the ArrowReadOptionsProto message (file_extension only; explicit Arrow schema rides on the existing schema-IPC byte channel rather than this proto, matching the other formats)
- ArrowReadOptions Java builder with fileExtension default ".arrow" and schema(Schema)
- SessionContext.registerArrow(name, path[, options]) and readArrow(path[, options]) overloads with null-argument validation per Andy's apache#47 review feedback
- native/src/arrow.rs JNI module that decodes the proto and dispatches to upstream SessionContext::register_arrow / read_arrow

Note on ArrowReadOptions construction: upstream's ArrowReadOptions exposes file_extension as a public field (not a builder setter), unlike the other format options. The native side uses struct-update syntax to set it without tripping clippy's field_reassign_with_default lint.

Tests cover proto round-trip, schema-by-reference, register/read on a fixture written by arrow-vector's ArrowFileWriter (the canonical Arrow IPC file format DataFusion's source supports), custom file extension, explicit Arrow schema, and null-argument rejection on both register and read.

Out of scope: tablePartitionCols (no parquet/csv/json analog on the Java side yet). Arrow IPC carries body compression inside the file format itself, so unlike CSV and NDJSON there is no FileCompressionType on this options class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: expose Arrow IPC reader via registerArrow and readArrow

1 participant